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Abstract 

We examine Euclidean distance preserving data perturbation as a tool for privacy-preserving data 
mining. Such perturbations allow many important data mining algorithms, with only minor modification, 
to be applied to the perturbed data and produce exactly the same results as if applied to the original data, 
e.g. hierarchical clustering and k-means clustering. However, the issue of how well the original data is 
hidden needs careful study. We take a step in this direction by assuming the role of an attacker armed 
with two types of prior information regarding the original data. We examine how well the attacker can 
recover the original data from the perturbed data and prior information. Our results offer insight into 
the vulnerabilities of Euclidean distance preserving transformations. 

Index Terms 

Euclidean distance preservation, privacy-preserving data mining, principal component analysis 

I. Introduction 

Recent interest in the collection and monitoring of data using data mining technology for the 
purpose of security and business-related applications has raised serious concerns about privacy 
issues. For example, mining health-care data for security/fraud issues may require analyzing 
clinical records and pharmacy transaction data of many individuals over a certain area. However, 
releasing and gathering such diverse information belonging to different parties may violate 
privacy laws and eventually be a threat to civil liberties. Privacy-Preserving Data Mining (PPDM) 
strives to provide a solution to this dilemma. It aims to allow useful data patterns to be extracted 
without compromising privacy. 

Data perturbation represents one common approach in PPDM. Here, the original private dataset 
X is perturbed and the resulting dataset Y is released for analysis. Perturbation approaches 
typically face a "privacy/accuracy" trade-off. On the one hand, perturbation must not allow the 
original data records to be adequately recovered. On the other, it must allow "patterns" in the 
original data to be recovered. In many cases, increased privacy comes at the cost of reduced 
accuracy and vice versa. For example, Agrawal and Srikant [1] proposed adding randomly 
generated i.i.d. noise to the dataset. They showed how the distribution from which the original 
data arose can be estimated using only the perturbed data and the distribution of the noise. 
However, Kargupta et al. [2] and Huang et al. [3] pointed out how, in many cases, the noise can 
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be filtered off leaving a reasonably good estimation of the original data (further investigated 
by Guo et al. [4]). These results point to the fact that unless the variance of the additive 
noise is sufficiently large, original data records can be recovered unacceptably well. However, 
this increase in variance reduces the accuracy with which the original data distribution can be 
estimated. This privacy /accuracy trade-off is not limited to additive noise, some other perturbation 
techniques suffer from a similar problem, e.g. k-anonymity [5]. 

Recently, Euclidean distance preserving data perturbation for census model Bias gained at- 
tention ( [7]— [13]) because it mitigates the privacy/accuracy trade-off by guaranteeing perfect 
accuracy. The census model using Euclidean distance preserving data perturbation can be illus- 
trated as follows. An organization has a private, real- valued dataset X (represented as a matrix 
where each column is a data record) and wishes to make it publicly available for data analysis 
while keeping the individual records (columns) private. To accomplish this, Y = T(X) is released 
to the public where T(.) is a function, known only to the data owner, that preserves Euclidean 
distances between columns. With this nice property, many useful data mining algorithms, with 
only minor modification, can be applied to Y and produce exactly the same patterns that 
would be extracted if the algorithm was applied directly to X. For example, assume single- 
link, agglomerative hierarchical clustering (using Euclidean distance) is applied directly to Y. 
The cluster memberships in the resulting dendogram will be identical to those in the dendogram 
produced if the same algorithm is applied to X. 

However, the issue of how well the private data is hidden after Euclidean distance preserving 
data perturbation needs careful study. Without any prior knowledge, the attacker can do very 
little (if anything) to accurately recover the private data. However, no prior knowledge seems 
an unreasonable assumption in many situations. Consideration of prior knowledge based attack 
techniques against Euclidean distance preserving transformations is an important avenue of study. 
In this paper, we take a step in this direction by considering two types of prior knowledge and, 
for each, develop an attack technique by which the attacker can estimate private data in X from 
the perturbed data Y. 

'The census model is widely studied in the field of security control for statistical databases [6]. 
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A. Attacker Prior Knowledge Assumptions 

1) Known input: The attacker knows some small collection of private data records. 

2) Known sample: The attacker knows that the private data records arose as independent 
samples of some n-dimensional random vector V with unknown p.d.f. And, the attacker has 
another collection of independent samples frorrj^ V. For technical reasons, we make a mild 
additional assumption that holds in many practical situations [14, pg. 27]: the covariance 
matrix of V has distinct eigenvalues. 

It is important to stress that we do not assume that both hold simultaneously. Rather, we consider 
each assumption separately and develop two separate attacks. 

Regarding the known input assumption, as pointed out in [13], this knowledge could be 
obtained through insider information. For example, consider a dataset where each record corre- 
sponds to information about an individual (e.g. medical data, census data). It is reasonable to 
assume that the individuals know (1) that a record for themselves appears in the dataset, and (2) 
the attributes of the dataset. As such, each individual knows one private record in the original 
dataset. A small group of malicious individuals could then combine their insider information to 
produce a larger known input set. As we will show through experiments, a set of four known 
inputs is enough to breach the privacy of another unknown input on a 16-dimensional, real 
dataset. 

Regarding the known sample assumptions, as pointed out in [13], this knowledge could be 
obtained through an insider with access to a competitor's dataset. For example, consider a pair 
of competing companies offering a very similar service to the same population (e.g. insurance 
companies). These companies each store information about individuals from the population 
in a dataset, one record per individual. It is reasonable to assume that (1) the records from 
each companies' dataset are drawn independently from the same underlying distribution, (2) 
each company collects the same or a heavily overlapping set of attributes (perhaps after some 
derivation), and (3) if one company releases a perturbed dataset, the other knows the attributes 
of that dataset. As such, a malicious insider at the other company has a known sample. 

2 These samples are not assumed to have been drawn from the original dataset X, rather, independently from the same 
distribution that X was. 
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B. Results Summary 

The first attack technique we develop is called the known input attack, the attacker is assumed 
to have known input prior knowledge, and proceeds as follows. (1) The attacker links as many 
of the known private tuples (inputs) to their corresponding columns in Y (outputs). (2) The 
attacker chooses a Euclidean distance preserving transformation uniformly from the space of 
such transformations that satisfy these input-output constraints. Based on the links established 
in step 1, we develop a closed-form expression for the privacy breach probability. Experiments 
on real and synthetic data indicate that even with a small number of known inputs, the attack 
can achieve a high privacy breach probability. 

The second attack technique we develop is called the known sample attack, the attacker is 
assumed to have known sample prior knowledge. The attacker uses the relationship between the 
eigenvectors of the perturbed data and the known sample data to estimate the private data, X, 
from the public perturbed data, Y. On real and synthetic data, we empirically study this attack 
and observe decreasing accuracy in three cases: (1) as the known sample size decreases, (2) as 
the separation between the eigenvalues of S v (the covariance matrix of V) decreases, and (3) as 
certain types of symmetries become more pronounced in the p.d.f. of V. The quality decrease 
in the first two cases is due to the fact that the eigenstates of V are difficult to estimate well. 
The quality decrease in the third case is due to inherent ambiguity present in the eigenstates of 
V, namely, they are determined up to minor flips of the normalized eigenvectors. 

II. Related Work 

This section presents a brief overview of the literature on data perturbation for PPDM. There 
is another class of PPDM techniques using secure multi-party computation (SMC) protocols for 
implementing common data mining algorithms across distributed datasets. We refer interested 
readers to [15] for more details. 

Additive perturbation: Adding i.i.d. white noise to protect data privacy is one common approach 
for statistical disclosure control [6]. The perturbed data allows the retrieval of aggregate statistics 
of the original data (e.g. sample mean and variance) without disclosing values of individual 
records. Moreover, additive white noise perturbation has received attention in the data mining 
literature from the perspective (described at the beginning of Section H]). Clearly, additive noise 
does not preserve Euclidean distance perfectly. However, it can be shown that additive noise 
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preserves the squared Euclidean distance between data tuples on expectation, but, the associated 
variance is largely We defer the details of this analysis to future work and do not consider 
additive noise further in this paper. 

Multiplicative perturbation: Two traditional multiplicative data perturbation schemes were 
studied in the statistics community [16]. One multiplies each data element by a random number 
that has a truncated Gaussian distribution with mean one and small variance. The other takes 
a logarithmic transformation of the data first, adds multivariate Gaussian noise, then takes the 
exponential function exp(.) of the noise-added data. These perturbations allow summary statistics 
(e.g., mean, variance) of the attributes to be estimated, but do not preserve Euclidean distances 
among records. 

To assess the security of traditional multiplicative perturbation together with additive pertur- 
bation, Trottini et al. [17] proposed a Bayesian intruder model that considers both prior and 
posterior knowledge of the data. Their overall strategy of attacking the privacy of perturbed 
data using prior knowledge is the same as ours. However, they particularly focused on linkage 
privacy breaches, where an intruder tries to identify the identity (of a person) linked to a specific 
record; while we are primarily interested in data record recovery. Moreover, they did not consider 
Euclidean distance preserving perturbation as we do. 

Data anonymization: Samarati and Sweeney [5], [18] developed the k-anonymity framework 
wherein the original data is perturbed so that the information for any individual cannot be 
distinguished from at least k-1 others. Values from the original data are generalized (replaced 
by a less specific value) to produce the anonymized data. This framework has drawn lots of 
attention because of its simple privacy definition. A variety of refinements have been proposed, 
see discussions on k-anonymity in various chapters in [19]. None of these approaches consider 
Euclidean distance preserving perturbation as we do. 

Data micro-aggregation: Two multivariate micro-aggregation approaches have been proposed 
by researchers in the data mining area. The technique presented by Aggarwal and Yu [20] 
partitions the original data into multiple groups of predefined size. For each group, a certain level 
of statistical information {e.g., mean and covariance) is maintained. This statistical information is 
used to create anonymized data that has similar statistical characteristics to the original dataset. 



3 To our knowledge, such observations have not been made before. 
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Li et al. [21] proposed a kd-tree based perturbation method, which recursively partitions a dataset 
into subsets which are progressively more homogeneous after each partition. The private data 
in each subset is then perturbed using the subset average. The relationships between attributes 
are argued to be preserved reasonably well. Neither of these two approaches preserve Euclidean 
distance between the original data tuples. 

Data swapping and shuffling: Data swapping perturbs the dataset by switching a subset of 
attributes between selected pairs of records so that the individual record entries are unmatched, 
but the statistics are maintained across the individual fields. A variety of refinements and 
applications of data swapping have been addressed since its initial appearance. We refer readers 
to [22] for a thorough treatment. Data shuffling [23] is similar to swapping, but is argued 
to improve upon many of the shortcomings of swapping for numeric data. However, neither 
swapping or shuffling preserves Euclidean distance which is the focus of this paper. 
Some other data perturbation techniques: Evfimievski et al. [24], Rizvi and Haritza [25] 
considered the use of categorical data perturbation in the context of association rule mining. 
Their algorithms delete real items and add bogus items to the original records. Association 
rules present in the original data can be estimated from the perturbed data. Along a related 
line, Verykios et al. [26] considered perturbation techniques which allow the discovery of some 
association rules while hiding others considered to be sensitive. 

A. Most Related Work 

In this part, we describe research most related to this paper. The majority of this focuses on 
Euclidean distance preserving data perturbation. 

Oliveira and Zaiane [8], [9], Chen and Liu [7] discussed the use of geometric rotation for 
clustering and classification. These authors observed that the distance preserving nature of 
rotation makes it useful in PPDM, but did not analyze its privacy limitations, nor did they 
consider prior knowledge. 

Chen et al. [12] also discussed a known input attack technique. Unlike ours, they considered 
a combination of distance preserving data perturbation followed by additive noise. And, they 
assumed a stronger form of known input prior knowledge: the attacker knows a subset of private 
data records and knows to which perturbed tuples they correspond. Finally, they assume that the 
number of linearly independent known input data records is no smaller than n (the dimensionality 

Nrwpmhpr 1 f\ 9000 DRAFT 



8 



of the records). They pointed out that linear regression can be used to re-estimate private data 
tuples. 

Mukherjee et al. [11] considered the use of discrete Fourier transformation (DFT) and discrete 
cosine transformation (DCT) to perturb the data. Only the high energy DFT/DCT coefficients are 
used, and the transformed data in the new domain approximately preserves Euclidean distance. 
The DFT/DCT coefficients were further permuted to enhance the privacy protection level. Note 
that DFT and DCT are (complex) orthogonal transforms. Hence their perturbation technique 
can be expressed as left multiplication by a (complex) orthogonal matrix (corresponding to the 
DFT/DCT followed by a perturbation of the resulting coefficients), then a left multiplication 
by an identity matrix with some zeros on the diagonal (corresponding to dropping all but the 
high-energy coefficients). They did not consider attacks based on prior knowledge. As future 
work, it would be interesting to do so. 

Turgay et al. [13] extended some of the results in our conference version of this work [10]. 
They assume that the similarity matrix of the original data is made public rather than, Y, the 
perturbed data itself. They describe how an attacker, given at least n + 1 linearly independent 
original data tuples and their corresponding entries in the similarity matrix, can recover the 
private data. Like Chen et al, this differs from our known input attack in two main ways: 
(i) we do not require prior knowledge beyond the known input tuples; (ii) our attack analysis 
smoothly encompasses the case where the number of linearly independent known input tuples is 
greater than n as well as less. Turgay et al. also describe how an attacker, given the underlying 
probability distribution of the original data, can use PCA to re-estimate the original data. This 
approach is based on ours in [10], with the following differences. First, they assume that the 
global distribution of the private data is known, but we only assume a small sample drawn 
from the same distribution is known. Second, they use a simple (and clever) heuristic to find 
the best eigenvector mirror directions while we use a complete, enumerative search. While their 
approach has only linear computational complexity with respect to data dimensionality and our 
is exponential, their approach will not produce as good an eigenvector matching as ours. It is an 
interesting direction for future work to explore empirically and analytically how well the results 
of their heuristic search fare against the results of our complete, enumerative search. 

Ting et al. [27] considered left-multiplication by a randomly generated orthogonal matrix. 
However, they assume the original data tuples are rows rather than columns as we do. As a result, 
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Euclidean distance between original data tuples is not preserved, but, sample mean and covariance 
are. If the original data arose as independent samples from multi-variate Gaussian distribution, 
then the perturbed data allows inferences to be drawn about this underlying distribution just as 
well as the original data. For all but small or very high-dimensional datasets, their approach 
is more resistant to prior knowledge attacks than Euclidean distance preserving perturbations. 
Their perturbation matrix is m x m (m the number of original data tuples), much bigger than 
Euclidean distance preserving perturbation matrices, n x n (n the number of entries in each 
original data tuple). 

Mukherjee et al. [28] considered additive noise to the most dominate principal components of 
the dataset along with a modification of k-nearest-neighbor classification on the perturbed data 
to improve accuracy. Moreover, they nicely extend to additive noise the pi-to-p 2 privacy breach 
measure originally introduced for categorical data in [24]. Their approach, however, does not 
preserve Euclidean distance, thus is fundamentally different than the perturbation techniques we 
consider. 

Before we briefly describe another two attacks based on independent component analysis 
(ICA) [29], it is necessary to give a brief ICA overview. 

1 ) ICA Overview: Given an n'-variate random vector V, one common ICA model posits that 
this random vector was generated by a linear combination of independent random variables, 
i.e., V = AS with S an n-variate random vector with independent components. Typically, S is 
further assumed to satisfy the following additional assumptions: (i) at most one component is 
distributed as a Gaussian; (ii) n' > n; and (iii) A has rank n. 

One common scenario in practice: there is a set of unobserved samples (the columns of n x q 
matrix S) that arose from S which satisfies (i) - (iii) and whose components are independent. 
But observed is n! x q matrix V whose columns arose as linear combination of the rows of 
S. The columns of V can be thought of as samples that arose from a random vector V which 
satisfies the above generative model. There are ICA algorithms whose goal is to recover S and 
A from V up to a row permutation and constant multiple. This ambiguity is inevitable due to 
the fact that for any diagonal matrix (with all non-zeros on the diagonal) D, and permutation 
matrix P, if A, S is a solution, then so is (ADP), (P^D^S). 

2) ICA Based Attacks: Liu et al. [30] considered matrix multiplicative data perturbation, 
Y = MX, where M is an ml x n matrix with each entry generated independently from the some 
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distribution with mean zero and variance a 2 . They discussed the application of the above ICA 
approach to estimate X directly from Y: S — X, V — y, S — X, V — Y, and A — M. They 
argued the approach to be problematic because the ICA generative model imposes assumptions 
not likely to hold in many practical situations: the components of X are independent with at 
most one such being Gaussian distributed. Moreover, they pointed out that the row permutation 
and constant multiple ambiguity further hampers accurate recovery of X. A similar observation 
is made later by Chen et al. [12]. 

Guo and Wu [31] considered matrix multiplicative perturbation assuming only that M is 
an n x n matrix (orthogonal or otherwise). They assumed the attacker has known input prior 
knowledge, i.e. she knows, X, a collection of original data columns from X. They develop 
an ICA-based attack technique for estimating the remaining columns in X. To avoid the ICA 
problems described in the previous paragraph, they instead applied ICA separately to X and 
Y producing representations (A^.,S^) and (Ay,Sy)- They argued that these representations 
are related in a natural way allowing X to be estimated. Their approach is similar in spirit 
to our known sample attack which related S and Y through representations derived through 
eigen-analysis. 

III. Euclidean Distance Preserving Perturbation and Privacy Breaches 

In this section, the definition of T, a Euclidean distance preserving data perturbation is 
provided, as well as the definition of a privacy breach. 

A. Notation and Conventions 

Throughout this paper, unless otherwise stated, the following notations and conventions are 
used. "Euclidean distance preserving" and "distance preserving" are used interchangeably. All 
matrices and vectors discussed are assumed to have real entries (unless otherwise stated). All 
vectors are assumed to be column vectors and M' denotes the transpose of any matrix M. Given 
a vector x, \\x\\ denotes its Euclidean norm. An m x n matrix M is said to be orthogonal if 
M'M = I n , the n x n identity matrixQ The set of all n x n, orthogonal matrices is denoted by 



4 lf M is square, it is orthogonal if and only if M' = M 1 [32, pg. 17]. 
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Given n x p and n x q matrices A and B, let [A\B] denote the n x (p + q) matrix whose 

first p columns are A and last q are B. Likewise, given p x n and q x n matrices A and B, let 
A 1 

denote the (p + g) x n matrix whose first p rows are A and last q are S. 



The data owner's private dataset is represented as an n x m matrix X, with each column 
a record and each row an attribute (each record is assumed to be non-zero). The data owner 
applies a Euclidean distance preserving perturbation to X to produce an n x m data matrix Y, 
which is then released to the public or another party for analysis. That Y was produced from X 
by a Euclidean distance preserving data perturbation (but not which one) is also make public. 

B. Euclidean Distance Preserving Perturbation 



A function H : K n — > dt n is Euclidean distance preserving if for all x,y G K n , \\x — y 



\\H(x) — H(y)\\. Here H is also called a rigid motion. It has been shown that any distance 
preserving function is equivalent to an orthogonal transformation followed by a translation [32, 
pg. 128]. In other words, H may be specified by a pair (M,v) G O n x 3? n , in that, for all 
x G 5i n , H(x) = Mx + v. If v — 0, H preserve Euclidean length: = as such, it 

moves x along the surface of the hyper-sphere with radius and centered at the origin. 



We do not assume that the correspondence between the columns of the perturbed dataset 
T(X) = Y (denoted yi, . . ., y m ) and the columns of the private dataset X (denoted x±, . . ., x m ) 
is known; i.e. the perturbed version of xi is not necessarily y^. Instead, the columns of X are 
transformed using a Euclidean distance preserving function, then are permuted to produce the 
columns of the perturbed dataset Y. Formally, the perturbed dataset Y, is produced as follows. 
The private data owner chooses (M T) vt), a secret Euclidean distance preserving function, and 
7r, a secret permutation of {1, . . . , m}. Then, for 1 < % < m, the data owner produces y^) = 
MrpXi + vt- 

Euclidean distance between the private data tuples is preserved in the perturbed dataset: for 
all 1 < i, j < m, \\xi — Xj\\ = \\y n {i) — y^(j)\\- Moreover, if v T = 0, then length of the private 
data tuples is also preserved: for all 1 < % < m, \\xi\\ = 



B 
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C. Privacy Breaches 

Based on the assumptions described earlier, the attacker will employ a stochastic attack 
technique^ and produce 1 < j < m and non-zero, x E 9? n . Here, x is an estimate of x~- 
(with j denoting 7r _1 (j)), the private data tuple that was perturbed to produce yjU Given e > 0, 
we consider three different privacy breach definitions. 

Definition 3.1: 1) An e-privacy breach occurs if \\x — < ||x-||e, i.e. if the attacker's 
estimate is wrong with Euclidean relative error no more than e. 

2) An e-MED-privacy breach (Minimum Entry Difference) occurs if min™ =1 {iV AD(xj i( Xi)} < 
e where x-- i and Xi are the i th entries and NAD (a, a) is the normalized absolute difference: 
equals a if a = 0, otherwise, equals \a — a\/\a\. 

3) An e-cos-privacy breach occurs if 1 — cos(Xj,x) < e where cos(w,w) denotes the cosine 
distancej: 

The relative Euclidean distance breach definition is inappropriate in situations where the 
accurate recovery of even one entry of a private data tuple is unacceptable to the data owner. 
The MED breach definition is intended for this situation. Moreover, the relative Euclidean 
distance breach definition is inappropriate for very high dimensional data (due to the curse 
of dimensionality) or where accuracy recovery of a private data tuple up to a scaling factor is 
unacceptable to the data owner. The cos breach definition is intended for these situations. 

In the next two sections, we describe and analyze an attack technique for each type of prior 
knowledge listed in Section [fl The main focus of analysis concerns, p(e), the probability that an 
e-privacy breach occurred. However, we briefly discuss how the analysis can be applied to the 
probability that an e-MED-privacy breach and e-cos-privacy breach occurred. 

IV. Known Input Attack 

For 1 < a < m — 1, let X a denote the first a columns of X. The attacker is assumed to 
know X a and her attack proceeds in two steps. (1) Infer as many as possible of the input-output 

5 Note that X and Y are fixed. 

6 The attacker does not need to know j; she is merely producing an estimate of the private data tuple that was perturbed to 
produce i/j. 

7 Note, < cos(w,w) < 1, equaling 1 if and only if w and w differ only by a scaling factor. 
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mappings in 7r a (the restriction of ir to {1, . . . , a})E (2) Using known inputs along with their 
inferred outputs, produce x. 

The bulk of our work involves the development and analysis of an attack technique in the case 
where the data perturbation is assumed to be orthogonal (does not involve a fixed translation, 
Vt = 0). The majority of this section is dedicated to developing and analyzing an attack in this 
case. Then, in Appendix H we briefly describe how the attack and analysis can be extended to 
arbitrary Euclidean distance preserving perturbation (v T ^ 0). 

A. Inferring 7i a 

The attacker may not have enough information to infer n a , so, her goal is to infer 717 (the 
restriction of n to / C {1, . . . , a}), for as large an / as possible. Next, we describe how the 
goal can be precisely stated as an algorithmic problem that the attacker can address given her 
available information. 

Given / C {1, . . . , a}, an assignment on I is a 1-1 function a : I — > {1, . . . , m}. An assignment 
a on / is valid if it satisfies both of the following conditions for all i, j E I, (1) = ||i/a(i)|| 
and (2) | Xi — Xj\\ — \\y a (i) ~ ya(j)\\- There is at least one valid assignment on /, namely 77, 
but, there may be more. / is uniquely valid if 77 is the only valid assignment on /. Given 
uniquely valid I C {1, . . . , a}, I is said to be maximal if there does not exist uniquely valid J 
C {1, . . . , a] such that \J\ > It can be shown that there exists only one maximal uniquely 
valid subset of {1, ... , a}. Thus, the attacker's goal is to find the maximal uniquely valid subset 
of {1, . . . , a} along with its corresponding assignment. 

The following straight-forward algorithm will meet the attackers goal by employing a top- 
down, level-wise search of the subset space of {1, . . . , a}. The inner for-loop uses an implicit 
linear ordering to enumerate the size £ subsets without repeats and requiring 0(1) space. 

Algorithm IV-A.l Overall Algorithm For Finding the Maximal Uniquely Valid Subset 
1: For £ — a, ... ,1, do 

2: For all / C {1, . . . , a} and |J| = I, do 

3: If I is uniquely valid, then output / along with its corresponding assignment and terminate the algorithm. 

4: Otherwise output 0. 



That is, find as many as possible perturbed counterparts of X a in Y . 
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Now we develop an algorithm that, given / C {1, . . . , a}, determines if / is uniquely valid, 
and, if so, also computes the corresponding assignment. The idea is search the space of all 
assignments on / for valid ones. Once more than one valid assignment is identified, the search 
is cut-off and the algorithm outputs that / is not uniquely valid. Otherwise, exactly one valid 
assignment, 7Tj, will be found. In this case, the algorithm outputs that / is uniquely valid and 
returns the corresponding assignment. The algorithm performs a depth-first search employing a 
simple, but effective, pruning rule to eliminate possible assignment choices at each node in the 
search tree. The rationale for this pruning rule is described next. 

Given I\ C I, a\ a valid assignment on I\, and i G (I \ ii), let C(ai,i) denote the set of 
all j G ({1, . . . , m} \ cti(Ii)) which satisfies both of the following conditions: (1) \\xi\ \ = \ \yj\\, 
and (2) for all i\ G I\, \\xi x — = ||y a i(n) — Uj\\- C(ai,i) can be thought of as the set of all 
valid candidate assignments for i as an extension of valid assignment at\. This provides pruning 
in the search through the assignment space over / as validated by the following theorem (whose 
proof is straight-forward). 

Theorem 4.1: Given I\ C / and i G (/ \ let a\ and d\ be valid assignments on l\ and 
[I\ U {i}), respectively. Further assume that c?i extends ct\ in the following sense: for all t G 
I\, di(£) = ai(£). Then, it is the case that di(i) G C{I\,i). 

Each node in the depth first search tree represents a valid assignment ol\ over some subset, I\, 
of /. The next step in the search extends the assignment by choosing % G and choosing 

an assignment for % from C(Ii,i). 

Algorithm IV-A.2 Determining Unique Validity Main 
Inputs: / C {1, . . . , a}. 
1: Set global variable NumValidAssignFound = 0. 

2: Call Algorithm IIV-A.31 on inputs and {at® denotes the unique valid assignment on 0). 

3: If NumValidAssignFound > 1, then return "I IS NOT UNIQUELY VALID". Else, return "J IS UNIQUELY VALID 
WITH ASSIGNMENT" q 7 . 



Comment: The order by which the elements of (/ \ I\) and C(Ii,i) are chosen in iterating 
through the for loops in Algorithm IIV-A.31 does not affect the correctness of the algorithm. 
However, it may affect efficiency. For simplicity, the loops order the elements in these sets from 
smallest to largest index number. 
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Algorithm IV-A.3 Determining Unique Validity Recursive 
Inputs: Ii C / and ct\ a valid assignment on I\. 

1: If h = I, then 

2: NumValidAssignFound + + 

3: If NumValidAssignFound == 1, then set ai to Qi. 
4: End If. 
5: Else, do 

6: For i £ (I \ Ii) and as long as NumV alidAssignF ound < 1, do 
7: For j G C(7i, i) and as long as NumValidAssignFound < 1, do 

8: Extend qi to ai s.t. di(i) = j. Let 7i = h U i. 

9: Call algorithm IIV-A.3I on inputs I\ and di, 



Algorithm IIV-A.ll has worst-case computational complexity 0(m a ). While this is no better 
than a simple brute-force approach, in our experiments, quite reasonable running times are 
observed because, few original data tuples will have the same length and/or few pairs of original 
data tuples will have the same Euclidean distance. 

B. Known Input-Output Attack 

Assume, without loss of generality, that the attacker applies Algorithm IIV-A. 1 1 and learns n q 
(0 < q < a), i.e. {1, . . . , q} is the maximal uniquely valid subset of {1, . . . , a}. Further, to simply 
notation, we may also assume that tt 3 (z) = Let Y q denote the first q columns of Y . As such, 
the attacker is assumed to know X q and the fact that Y q = M T X q where M T is an unknown 
orthogonal matrix. Based on this, she will apply an attack, called the known input-output attack, 
to produce q < j < m, and x, which is an estimate of x-., the private tuple that was perturbed 
to produce 

The attack is performed in two steps: 1) Using X q and Y q , the attacker will produce M, an 
estimation of M T ; 2) Then, for any q < j < m, the attacker can produce estimate 

x = M'yj. (1) 

Let p{xj, e) denote Pr{\\x — x^\\ < ||x-||e), the probability that an e-privacy breach will result 
from the attacker estimating Xj as x. We will develop a closed-form expression for p(x^,e). 

9 This can be achieved by the attacker appropriately reordering the columns of X a and Y. 

10 If M ~ Mt, then x w M' T Uj = M' t (Mtx-^) = x-. where x~- was the private tuple perturbed to produce yj. 
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This expression will only involve information known to the attacker, therefore, she can choose 
q < j < m so as to maximize p(xj, e). 

Since Y q = MtX q , the attacker knows that Mt must have been drawn from M.(X q ,Y g ), 
which is the set of all M G O n such that MX q = Y q . However, since the attacker has no 
additional information for further narrowing down this space of the possibilities, she will assume 
each is equally likely to be Mt- She will choose M uniformly from M(X 9 ,Y^)Hj In most 
cases, M(X q , Y q ) is uncountable. As such, it is not obvious how to choose M uniformly from 
Wl(X q , Y q ) and also not obvious how to compute p(x^, e) = Pr(\\x — Xj\\ < ||x-||e). These issues 
will be discussed in Section HV-Dl Before doing so, we discuss some important linear algebra 
background. 

C. Linear Algebra background 

Let Col(X q ) denote the column space of X q and Col±(X q ) denote its orthogonal complement, 
i.e., {z G 3ft™ : z'w = 0, Vw; G Col(X q )}. Likewise, let Col(Y q ) denote the column space of Y q 
and Col±(Y q ) denote its orthogonal compliment. Let k denote the dimension of Col(X q ). The 
"Fundamental Theorem of Linear Algebra" [33, pg. 95] implies that the dimension of Col±(X q ) 
is n — k. Since Y q = MxX q and Mt is orthogonal, then it can be shown that Col(Y q ) has 
dimension k. Thus, Col±(Y q ) has dimension n — k. 

Let Uk and Vk denote n x k matrices whose columns form an orthonormal basis for Col(X q ) 
and ColiYq), respectively. It can easily be shown that Col(MTUk) = Col(Y q ) = Col(Vk). Let 
U n -k an d V n -k denote n x (n — k) matrices whose columns form an orthonormal basis for 
Col±(X q ) and Col ±(Y q ), respectively. It can easily be shown that Col{MTU n -k) = Col±(Y q ) 
= Col(V n ^ k ). 

D. A Closed-Form Expression for p(xj,e) 

Now we return to the issue of how to choose M uniformly from M(X q , Y q ) and how to 
compute p(xj,e) = Pr(\\x — x^\\ < \\xj\\e) = Pr(\\M'M T Xj — 

To choose M uniformly from M(X q ,Y q ), the basic idea is to utilize standard algorithms for 
choosing a matrix P uniformly from O n -k, then apply an appropriately designed transformation 

"This uniform choice of M is equivalent to a maximum likelihood estimate of x~. for any q < j < m. 
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to P. The transformation will be an affine, bijection from 



"n-k 



to M(X, The following 



technical result, proven in Appendix HI provides this transformation!^ 

Theorem 4.2: Let L be the mapping P E Q n -k | — ► MTUkU' k +V n -kPU' n _ k . Then, L is an affine 



bijection from O n _ fc to M(X 9 ,F 9 ). And, L" 1 is the mapping M E M(X q , Y q 



VUMU n . k . 



Algorithm IV-D.l Uniform Choice From M(X g ,Y q ) 

Inputs: Uk, an n x k matrix whose columns form an orthonormal basis of Col(X q ), and MrU k (Mt is unknown); U n - k and 

V n -k, n X (n — fc) matrices whose columns form an orthonormal basis of Co^(Xq) and Col±(Y q ), respectively. 
Outputs: M a uniformly chosen matrix from M.(X q , Y q ). 

1: Choose P uniformly from O n -fc using algorithm [34]. 

2: Set M = L(P), i.e., M T U k U' k + V n - k PU' n _ k . 



Some special cases are interesting to highlight: when k = n, M is chosen as M T ; when k = n—1, 
M is one of two choices (one of which equals M T ); otherwise, M is, in theory, chosen from 
an uncountable set (containing M T ). 

Now we develop a closed-form expression for p(xj,e). The key points are outlined, while a 
more rigorous justification is provided in Appendix HI First of all, from Algorithm IIV-D.ll M 
= MrUkU' k + V n -kPU' n _ k where P is chosen uniformly from Q n -k- Therefore, 



p(x-.,e) = Pr(\\M'M T x.-x.\\<\\x.\\e) 

= Pr(\\U k U' k x- + U n ^ k P'V n _ k M T x: } -x-\\<\ |x-.||e). 



Since 



UL 



TP 

u n~k 



E O n , then it can left-multiply each term in the left 



of the second 



probability without changing the equality. As a result, the derivation continues 



Pr 



U'k x i 




+ 







P'VUMtx- 



n—k± ~ ] 

Pr(\\PX- k M T x--U^\\<\h\\e). 



12 That the resulting M was chosen uniformly from M.(X q ,Yq) could be more rigorously justified using left-invariance of 
probability measures and the Haar probability measure over O n _fc. But, such a discussion is not relevant to this paper and is 
omitted. 

13 We define Oo to contain a single, empty matrix. And, for P 6 Oo, we define V n - k PU' n _ k to be the n x n zero matrix. 
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Since Col(M T U n _ k ) = Col(V n ^ k ), then there exists (n — k) x (n — A;) matrix £> such that 
M T U n _ k B = V n ^ k . It follows that (i) V^_ k = B'U' n _ k M' T , (ii) B = U' n ^ k M' T V n _ k . Thus, 5 is 
orthogonalQ Using (i), the derivation continues 



••• = Pr{\\P'B\U' n _ k x.)-{U' n _ k x.)\\<\\x.\\t) (2) 

= Pr{\\P\U' n _ k x^) - (K_ k x-)\\ < \\x 3 \\e) (3) 

where the second equality is due to the fact that B' e O n _ k , and thus (P'B') can be regarded as 
having been uniformly chosen from O n - k just like P' (a rigorous proof of the second equality 
is provided in Appendix U). Putting the whole derivation together, 



p(xj,e) = Pr(P uniformly chosen from O n _ fc satisfies \\P'(U' n _ k Xj) — (U' n _ k Xj)\\ < \\xj\\e). 

(4) 

Let S n -k(\\U! n _ k Xj\\) denote the hyper-sphere in 3? n ~ fc with radius and centered 

at the origin. Since P is chosen uniformly from Q n - k , then any point on the surface of 
S n -h{\\U' n _ k x--\\) is equally likely to be P'(U' n _ k xj). Let S n -k(U' n _ k Xp ||xj||e) denote the "hyper- 
sphere cap" consisting of all points in S n -k(\\U^_ k x^\\) with distance from U' n _ k x-- no greater 
than ||x--||e. Therefore, © becomes 



p(xj,e) = Pr(a uniformly chosen point on S n - k (\\U' n _ k Xj\\) is also in S n - k (U' n _ k x~p ||a;--||e)) 
SA{S n - k {U' n _ k x^\\x^\\e)) 



(5) 



SA{S n „ k {\\U> n _ k xi\\)) 

where SA(.) denotes the surface area of a subset of a hyper- sphere [J£| Based on equations ©, we 
prove, in AppendixH the following closed form expression, for p(xj, e), where, T(.) denotes the 



standard gamma function, ac[]_i(x) denotes arccos 
arccos 1 



1 , and acin(x) denotes 



14 B'B = B'U' n _ k M!rV n - k = V^ k V n - k = I n - k . 
15 <Si(| |?7"iX--| |) consists of two points. We define 



otherwise. Moreover, we define 



SAjSpjU^,, \\x,\\e)) 

SAiSoWLx-jW)) 



SA(Si(\\U[x{\\)) 



as 0.5 if S\(U[x-,, ||:r-.||e) is one point, and as 1 



as 1. 
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|[/4-fc£j||2 and n — 
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k < 


[/4_ fc 3;-.||2 and n — 


k = 1; 


if 






|V2< ||ac- ||e < \\u; 


- fc Xj||2 and n 


if 


\\K 


-» x 3 


\V2 < \\x*\\e < \\U' 


_i,Xj||2 and n 


if 


IN 


|e < 


lU^-kXjW^ and n 


- it = 2; 


if 


11^ 


k< 


\U' n _ k X'j\\V2 and n 


- fc > 3. 



(6) 



Comment: it can be shown that | \U' n _ k Xj\ | is the distance from x~ to its closest point in Col(X q ). 
Thus, the sensitivity of a tuple to breach is dependent upon its length relative to its distance to 
the column space of X q . 

Recall that the attacker seeks to use the closed-form expressions for p(x-,e) to decide for 
which q < j < m does x = M'yj produce the best estimation of x~-. This is naturally 
done by choosing j to maximize p(x--,e). To allow for this, observe that ||^-||e and ||£^-fc x j-ll 
equate | \yj\ \e and | respectively, which are known to the attacker. Therefore, © can be 



rewritten as follows, where ac[]_i(y) denotes arccos 

2 



arccos 



WviW 



\\v^ h Vi\\y/2 



11% Ik 



1 J , and aci_ [](?/) denotes 



l 
l 

0.5 

1 - (l/7r)ac D _i(2/) 
1 

(l/7r)ac 1 _ [] (y) 

(n-fc-l)r([n-fc + 2]/2) r ac l~[](v) 
(n-fc)v^r([n-fc + l]/2) J»i=0 



(n-fc-l)r([n-fc + 2]/2) f"i 
(n-fc)v^T([n-fc + l]/2) 



r^[]-l(y) . n -k-l 



n - k - 1 (6 1 )d6 1 



if n - k = 0; 

if ||to||e> ||^_ fcV j||2 andn-ft> 1; 

if ll%ll« < IIK-fc»l|2 andn-fc = 1; 

if ||K-^llv / 2< ||to||e< ||K-kVj||2andn-* = 2; 

if ||K-fe%||v / 2< ||»j||e< ||K_ k to||2andn-*>3; 

if \\yj\\e < ||K-aW||V2 and n - = 2; 

if ||l/j||e< \K_ M \\V2 and n-k> 3. 

(7) 



To spell out the attack algorithm, first note that t^-it, and V n -k can be computed 
from X q and Y ? using standard procedures [33]. Second, M T U k = Y q A where A is an q x A; 



^MtXj = i/j, so, ll^'jll = |Mt£--| = \\ui\\- Moreover, as shown earlier, there exists B G O n -k such that V! n _ k 
B'U' n _ k Mlr. Thus, \\U' n _ k x^\\ = \\B'U' n _ k M' T MTX-\\ = \K_ kVj \\. 
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matrix that can be computed^ from 17% and X q . Third, a recursive procedure for computing © 
is described in Appendix IB The precise details of the attack technique can be seen in Algorithm 
IIV-D.2I The e-privacy breach probability p(e) equals max q< j< m p(xj, e). 



Algorithm IV-D.2 Known Input Attack Algorithm 

Inputs: Y, e > 0, and X q . The attacker knows Y q = MrX q (Mt is unknown). 
Outputs: q < j < m and x £ 5R n the corresponding estimate of Xj. 

1: Compute Uk, Vk, U n -k, V n -k, and MrUk as described earlier. 

2: For each q < j < m do 

3: Compute p{x~.,e) using as described in Appendix U 
4: End For. 

5: Choose the j from the previous loop producing the largest p(x--,e). 
6: Choose M uniformly from M(X q ,Y q ) by applying Algorithm llV-D. II 

7: Set x <- M'j/j. 



K Experiments 

The experiments are designed to assess the computational efficiency of the overall known 
input attack and its effectiveness at breaching privacy. We used two datasets as the input 
X, respectively: 1) a 100,000 tuple synthetic dataset generated from a 100-variate Gaussian 
distribution_|; 2) the Letter Recognition dataset, 20,000 tuples and 16 numeric attributes, from 
UCI machine learning repository - we removed tuples which were duplicated over the numeric 
attributes yielding a final dataset of 18,668 tuples. The attacks were implemented in Matlab 7 
(R14) and all experiments were carried out on a Thinkpad laptop with 1.83GHz Intel Core 2 
CPU, 1.99GB RAM, and WindowsXP system. 

The first experiment fixes X and its perturbed version Y, but changes the number of known 
input tuples, a. It proceeds by carrying out ten iterations as follows. Select a linearly independent 
tuples randomly from X (these become the know inputs). Use Algorithm IIV-A.1I to compute 

l7 Since Col{Uk) = Col(X q ), then by solving k systems of linear equations (one for each column of Uk), a q x k matrix A 
can be computed such that X q A = Uk- 

18 The mean vector is specified by independently generating 100 numbers from a univariate Guassian with mean zero and 
variance one. The covariance matrix is specified by (i) independently generating 100 data tuples each with 100 independently 
generated entries a from a univariate Guassian with mean zero and variance one, (ii) computing the empirical covariance of this 
100 tuple dataset. 
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/, the maximal uniquely valid assignment. Use steps 2-5 in Algorithm IIV-D.2I to compute the 
p(e), the e-privacy breach probability (a closed- form was given immediately above Algorithm 
IIV-D.2I) . 

To measure the accuracy of the attack, we report the average of p(e) and |/| over all iterations. 
To measure the efficiency, we report the average time taken to compute / (the rest is ignored 
as the overall attack computation time is dominated by Algorithm IIV-A. II) . In Figures \T\ and [2l 
results are shown with e = 0.15. In Figure [3l accuracy results are shown with varying e and a 
fixed at four. In all Figures, the error bars show one standard deviation above and below the 
average. 
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Fig. 1 

Known input attack on Gaussian data with different number of known inputs and e = 0.15. 
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Fig. 2 

Known input attack on Letter Recognition data with different number of known inputs and e = 0.15. 



The second experiment fixes the number of known input tuples (and e at 0.15) but changes the 
size of the original data X in order to assess the computational efficiency of the attack. For the 
Gaussian data, it uses the first k tuples as X where k takes a value in {10000, 20000, . . . , 100000}. 
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Known input attack on Letter 
Recognition data with fixed 
data size, fixed number of 

KNOWN INPUTS a = 4, BUT 
VARYING e. 



Fig. 4 

Known input attack on Gaussian (left) and Letter Recognition data 
(right) with varying size, but fixed number of known inputs a = 50, 10 
(respectively) and fixed e = 0.15. 



Then the attack proceeds by carrying out the following operations ten times. Select a = 50 
linearly independent tuples randomly from X and use Algorithm IIV- A. 1 1 to compute the maximal 
uniquely valid assignment /. The average time taken to compute I is given in Figure 0] left. For 
the Letter Recognition data, k takes a value in {2000, 4000, . . . , 18000} and the attack randomly 
select a = 10 linearly independent tuples as the known inputs. The average time taken to find / 
is given in Figure 0] right. 

Regarding the known input attack accuracy, the linking phase of the attack (Algorithm IIV- A. ll) . 
exhibits excellent performance. For synthetic data, its performance is perfect in that all known 
input tuples have their corresponding perturbed tuple inferred (see Figure \T\ left). For real data, 
its performance is nearly perfect - see Figure |2] left. As expected, p(e) approaches one as a 
increases see Figures \T\ and |2] right. Interestingly on the synthetic dataset, the transition from 
p(e) = — ► 1 occurs very sharply around a = 60. Moreover, on the real dataset, p(e) = 1 with 
a as small as 4 (and we also observe in Figure |3] that the probability remains fairly high for e 
as small as 0.07). 

Regarding computational efficiency, the algorithm appears to require quite reasonable time in 
all cases observed, e.g. less that 450 seconds on the synthetic dataset with 100 known tuples 
(see Figure Q] center) and less than 45 seconds on the real dataset with 16 known inputs (see 
Figure [2] center). With respect to known input set size (a), the average computation time exhibits 
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a linear (synthetic data) or slower (real data) trend (see Figure [TJ center and Figure |2] center). With 
respect to dataset size (number of private data tuples), the average computation time exhibits 
a clear linear trend for both synthetic and real data (see Figure 0] left and right). These results 
demonstrate that, despite the high worst-case computational complexity, the computation times 
on both real and synthetic data are quite reasonable. 

The experimental results support the conclusion that the attack can breach privacy in plausible 
situations. For example, on the 16-dimensional, 18688 tuple real dataset, the known input attack 
achieves a privacy breach with probability one using four known inputs and less than 30 seconds 
of run-time. 



F. Analysis Over MED and Cos Privacy Breach Definitions 

\\xi— x\\ 

e-MED-privacy breach: It is shown in Appendix U that mm? =1 {NAD(xj ;p xi)} < . 
Hence, if an e-privacy breach occurs, then so does an e-MED-privacy breach. Therefore, the 
analysis of the known input attack can be used to lower-bound the probability that an e-MED- 
privacy breach occurred. As a result, our experiments show that on the Letter Recognition 
data, four known inputs produce an e-MED-privacy breach with probability one. Unfortunately, 
the lower-bound is not tight as examples can be found making the relative Euclidean distance 
arbitrarily larger than the minimum MED distance. 

x '\\ 2 

e-cos-privacy breach: It is shown in Appendix U that 1 — cos(x, x*) = 2 \\x-]\ 2 • Therefore, an 
e-cos-privacy breach occurred if and only if an (v2e) -privacy breach occurred. Therefore, the 
analysis of the known input attack can be easily modified to produce a closed-form expression 
for the probability that an e-cos-privacy breach occurred. 

V. Known Sample Attack 

In this scenario, we assume that each data column of X arose as an independent sample from 
a random vector V with unknown p.d.f. We also make the following mild technical assumption: 
the covariance matrix S v of V has all distinct eigenvalues. Furthermore, we assume that the 
attacker has a collection of q samples that arose independently from V - these are denoted as 
the columns of matrix S. It is important to stress that the columns of S are not assumed to be 
samples from the private data X, rather, they are samples drawn from V independently of X. 
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Using these assumptions, we will design a Principal Component Analysis (PCA)-based attack 
technique and analyze its privacy breach probability through experiments. 

Attack Intuition: The basic procedure is to estimate M T and use this estimate to undo the 
data perturbation applied to X. The key idea in estimating M T is that the principle components 



oo S(a/ t v+i- t ) equal the perturbation by M T of the principal components of S v up to a mirror 
flip about each component. Since these covariance matrices can be estimated from Y and S, 
respectively, then so can the corresponding principal components. The equality above then allows 
Mt to be estimated up to mirror flips. To choose the right mirror flip, an equality of distributions 
test is applied using S and Y . 



A. PCA Preliminaries and a Key Property 

Because S v is annxn, symmetric matrix with all distinct eigenvalues, it has n real eigenvalues 
Ai > . . . > A n and their associated eigenspaces, {z G 9ft ra : Ey^ = z\i}, are pair-wise orthogonal 
with dimension one [33, pg. 295]. As is standard practice, we restrict our attention to only a small 
number of eigenvectors. Let Z(V)i denote the set of all vectors z G 3? n such that T, v z = z\ 
and ||z|| = 1. We call this the normalized eigenspace of Aj. The normalized eigenspaces of £y 
are related in a natural way to those of £(m t v+u t ), as shown by the following theorem (proven 
in Appendix U). This theorem is important as it will provide the foundation for a technique by 
which the attacker can estimate Mt from Y and S. 

Theorem 5.1: The eigenvalues of S v and S( A f T v+^ T ) are the same. Let Z(M T V + vx)i denote 
the normalized eigenspace of T,(m t V+v t ) associated with Aj, i.e. the set of vectors w G dt n such 
that E( Mt v+v t ) w = and ||w|| = 1. It follows that M T Z(V)i = Z(M T V + vx)i, where 
M T Z(V)i equals {M T z : z G Z(V)i}. 

Because all the eigenspaces of S v have dimension one, it can be shown that each normalized 
eigenspace, Z(V)i, contains only two vectors and these differ only by a factor of —1. Thus, 
letting Zi denote the lexicographically larger vector, Z(V)i, can be written as {zj, — Zi}. Let Z 
denote the n x n eigenvector matrix whose i th column is Z{. Because the eigenspaces of E v are 
pairwise orthogonal and H^H = 1, Z is orthogonal. Similarly, Z(MtV + Vt)% can be written as 
{wi, —Wi} (Wi is the lexicographically larger among Wi, —Wj) and W is the eigenvector matrix 



Ti(m t v+v t ) denotes the covariance matrix of random vector MtV + vt- 
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with i column u>j (W is orthogonal). Note again that columns in both Z and W are ordered 
such that the i th eigenvector is associated with the i th eigenvalue. The following result, proven 
in Appendix HI forms the basis of the attack algorithm. 

Corollary 5.2: Let I n be the space of all n x n, matrices with each diagonal entry ±1 and 
each off-diagonal entry (2 n matrices in total). There exists D £ I n such that Mt = WD Z'. 

B. Known Sample Attack (PCA Attack) Algorithm on Orthogonal Data Perturbation 

Like Section [IV] we first develop the attack technique in the case where the data perturbation 
is assumed to be orthogonal (does not involve a fixed translation, vt = 0). Then, in Section IV-El 
we discuss how the attack technique can be extended to arbitrary Euclidean distance preserving 
perturbation (vt ^ 0). 

First assume that the attacker knows the covariance matrices Ey and Y>m t v an d, thus, computes 
W, the eigenvector matrix of E MtV , and Z, the eigenvector matrix of Ey. By Corollary 15 .21 
the attacker can perfectly recover M T if she can choose the right D from I n . To do so, the 
attacker utilizes S and Y, in particular, the fact that these arose as independent samples from 
V and M T V, respectively. For any D E I n , if D = D , then WDZ'S and Y have both arisen 
as independent samples from M T V. The attacker will choose D E I n such that WDZ'S is 
most likely to have arisen as an independent sample from the same random vector as Y. To 
make this choice, the attacker can use a multi-variate two-sample hypothesis test for equal 
distributions [35]. Let p(WDZ'S,Y) denote the resulting p-value. The smaller the p-value, the 
more convincingly the null hypothesis (that WDZ'S and Y have arisen as independent samples 
from identically distributed random vectors) can be rejected. Therefore, D E I n is chosen to 
maximize p(WDZ'S, Y). 

Finally, the attacker can eliminate the assumption at the start of the previous paragraph by 
replacing S v and S A f T v with estimates computed from S and Y. In experiments we use the 
standard, sample covariance matrices S5 and Ey. Algorithm IV-B.ll shows the complete PCA- 
based attack procedure. 

Since the two-sample test requires 0((m + q) 2 ) computation for p(., .), the overall computation 
cost of Algorithm |^BT] is 0(2 n (m + q) 2 ). 

Take note that the quality of covariance matrix estimation from S and Y impacts the ef- 
fectiveness of the attack. Clearly, poor quality estimation will result in low attack accuracy. 
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Algorithm V-B.l PCA-based Attack Algorithm on Orthogonal Data Perturbation 

Inputs: S, an n x q matrix where each column arose as an independent sample from V - a random vector with unknown p.d.f. 

whose covariance matrix has all distinct eigenvalues; and such that the columns of X arose as independent samples from 

V. Y where Mt is an unknown, n x n, orthogonal matrix. 
Outputs: 1 < j < m and x g 5R n the corresponding estimate of x--. 
1: Compute sample covariance matrix Es from S and sample covariance matrix XV from Y. 

2: Compute the eigenvector matrix Z of Es and W of Ey. Each eigenvector has unit length and is sorted in the matrix by 
the corresponding eigenvalue. 

3: Choose D = argmax{p(WDZ' S, Y) : D G I„}, choose 1 < j < m randomly, and set x — ► the j th column in ZDW'Y. 



With the exception of sample size, we do not consider other sampling factors (departures from 
independence, noise, outliers, etc.) that effect the quality of covariance matrix estimation. We 
feel such issues are orthogonal to this work as any technique for covariance matrix estimation 
can be used in the attack. For simplicity, we stick with the standard, sample covariance matrices. 

C. Experiments - Orthogonal Data Perturbation 

We conduct experiments on both synthetic and real world data to evaluate the performance of 
PCA-based attack on orthogonal data perturbation. We choose the perturbation matrix uniformly 
from O n and keep it fixed for the same private data. Since the choice of n does not affect the 
experiments, we choose the identity permutation throughout. To approximate the probability of 
privacy breach, we compute a fraction of the breach out of 100 independent runs. In all figures 
demonstrated in this section, a solid line is added showing a best polynomial fit to the points. 
This line is generated with Matlab's curve fitting toolbox. The attack was implemented in Matlab 
6 (R13) and all experiments were carried out on a Dell dual-processor workstation with 3.00GHz 
and 2.99GHz, Xeon CPUs, 3.00GB RAM, and WindowsXP system. 

The synthetic dataset contains 10, 000 data points, and it is generated from a multi-variate 

(1 1.5 0.5 \ 
1.5 3 2.5 . The attacker has a 
0.5 2.5 75 J 

sample generated independently from the same distribution. We conduct experiments to examine 
how sample size affects the quality of the attack. Figure [5] shows that when the relative error 
bound is fixed, the probability of privacy breach increases as the sample size increases. 

For the real world data, we choose the Letter Recognition Database and Adult Database from 
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sample ratio sample ratio sample ratio 



Fig. 5 

PC A-BASED ATTACK FOR 
THREE-DIMENSIONAL GAUSSIAN 
DATA (e = 0.02). 



Fig. 6 

pca-based attack for letter 
Recognition data. 



Fig. 7 

PCA-BASED ATTACK FOR ADULT 
DATA. 



the UCI machine learning repositoryo The Letter Recognition data has 20, 000 tuples and 16 
numeric features. We choose the first 6 attributes (excluding the class label) for the experiments. 
Note that unlike the experiments in Section ITV-E[ here we do not remove duplicates. The Adult 
data contains 32, 561 tuples, and it is extracted from the census bureau database. We select three 
numeric attributes: age, education-num and hours -per- week, for the experiments. We randomly 
separate each dataset into two disjoint sets. One set is viewed as the original data, and the other 
one is the attacker's sample data. To examine the influence of sample size, we perform the 
same series of experiments as we do for Gaussian data. Figure [6] gives the results for Letter 
Recognition data. Figure [7] gives the results for Adult data. 

From the above experiments, we have the following observations: (1) the larger the sample 
size, the better the quality of data recovery and (2) among these three datasets, the PCA-based 
attack works best for Gaussian data, next Letter Recognition data, and then Adult data. The first 
observation require no explanations. We will discuss the second one in the next section. 

20 http://mlearn.ics. uci.edu/MLSummary.html 
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D. Effectiveness of the Known Sample Attack (PCA Attack) Algorithm on Orthogonal Data 
Perturbation 

The effectiveness of the PCA Attack algorithm can be hampered by the presence of either two 
of the following properties of the p.d.f., f, of V. (1) The eigenvalues of Sy are nearly identical. 
(2) For some Di ^ D E I n , / is invariant over Di in the sense that fr, i and fr> can't be 
distinguished, where f Do and f D . are the p.d.f.s of WD Z'V and WDiZ'V. 

First, suppose the eigenvalues of Sy are nearly identical. Without loss of generality, we can 
assume V has a diagonal covariance matrix whose diagonal entries (from top-left to bottom- 
right) are d, d — (3, d — 2(3, . . ., d — n(3 where d — n(3 > and < (3 < 1 is small. In 
this case, small errors in estimating S v from sample S can produce a different ordering of the 
eigenvectors, hence, large errors in the attacker's recovery. As an extreme case, when V is the 
n-variate Gaussian with covariance matrix J„7 for some constant 7, all the eigenvalues are the 
same, and there is only one eigenspace, 9ft n . The PCA attack algorithm will fail. 

Consider the minimum ratio of any pair of eigenvalues, i.e., min{Xi/Xj : Vi 7^ = 
1, . . . ,n} (we call this the minimum eigen-ratio). We would expect that, the smaller this value, 
the smaller the attacker's success probability. To examine this hypothesis, we generate a three- 
dimensional dataset of tuples sampled independently from a Gaussian with mean (10, 10, 10) and 

(0.1 \ 
2 .By changing the value of b from 2 to 40, we can change the minimum 
b J 

eigen-ratio from 1 to 20. The original data contains 10, 000 tuples. We fix the sample ratio to 
be 2% and relative error bound e = 0.05. Figure [8] shows that when all other parameters are 
fixed, the higher the eigen-ratio, the better the performance of the attack algorithm. This actually 
explains why, in our previous experiments, PCA attack works best for Gaussian data, then Letter 
Recognition data, and then Adult data. A simple computation shows that the minimum eigen- 
ratios of the Gaussian data, Letter Recognition data and Adult data are 19.6003, 1.3109, 1.2734, 
respectively. 

Second, suppose / is invariant over some D { ^ D E I n . Then the p(WD Z' S, Y) may not be 
larger than p(WDiZ'S, Y), and the attack algorithm will fail. We would expect that the closer 
/ is to invariance, the smaller the attacker's success probability. To examine this hypothesis we 
need a metric for quantifying the degree to which / is invariant. Intuitively, the invariance of / 
can be quantified as the degree to which fr> i and fp are distinguishable (minimized over all 
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minimum eigen-ratio 



Fig. 8 

PC A-BASED ATTACK W.R.T. MINIMUM EIGEN-RATIO 
(e = 0.05 AND SAMPLE RATIO 2%). 




Fig. 9 

PCA-BASED ATTACK W.R.T. a (e = 0.05 AND SAMPLE 
RATIO 2%). 



Di 7^ D G I n )- To formalize this definition, we use the symmetric Kullback-Leibler divergence 
KL(g\\h) + KL(h\\g) to measure the difference between two continuous distributions g and h. 
This measurement is symmetric and nonnegative, and when it is equal to zero, the distributions 
can be regarded as indistinguishable. So, we quantify invariance as 



Inv(f) = min {KL(f Di \\f Do ) + KL(f Di \\f Do )} , (8) 

Clearly Inv(f) > with equality exactly when / is invariant. The behavior of Inv in the 
general case is quite complicated. However, under certain (fairly strong) assumptions, Inv(f) 
can be nicely characterized. In Appendix U we provide derivation details of the following result. 
Let fi be some fixed element of 3? n . Assume that / is an n-variate Gaussian distribution with 
mean vector [iy — afj, for some a > and invertible co variance matrix £y. We have, 

Inv(f) = a 2 min (fj,'Z(Di - D )Ay 1 (D i - D )Z'ii) , (9) 

where A v and Z are the eigenvalue and eigenvector matrices of £y, respectively. Hence, we see 
that Inv(f) approaches zero quadratically as a — > 0. 

With this result we can carry out experiments to measure the effect of the degree to which / 
is invariant on the attacker's success probability. We generate a dataset by sampling each tuple 
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(0.1 \ 
020 and mean 
40 J 

vector yU V = a(l, 1, 1)'. Note that the minimum eigen-ratio is 20, sufficiently large to isolate 
the effect of decreasing invariance on attacker's success probability. We change the value of a 
from to 10. The original dataset contains 10, 000 tuples. We fix the sample ratio to be 2%, and 
relative error bound e = 0.05. Figure [9] shows that as the mean approaches zero, the probability 
of privacy breach drops to zero too; however, as the mean runs away from zero, the probability 
of privacy breach increases very fast. 

E. Known Sample Attack (PCA Attack) Algorithm on General Distance-Preserving Data Pertur- 
bation 

In the previous subsections, we had considered the case where the data perturbation is assumed 
to be orthogonal (does not involve a fixed translation, u T = 0). Now we consider how the attack 
technique can be extended to arbitrary Euclidean distance preserving perturbation (v T ^ 0). The 
basic idea is very similar to that regarding the known input attack described in Appendix H Since 
the same Vt is added to all tuples in the perturbation of X, then by considering differences, we 
can transform the situation back to the orthogonal data perturbation case and apply the same 
attack technique described above. However, since the PCA attack assumes that the tuples in X 
arose independently from V, then the difference tuples over Y cannot be computed with respect to 
a single fixed tuple (the resulting tuples could not be regarded as having arisen independently). 
Instead, disjoint pairs of tuples from Y must be used. Further since the tuples in S are also 
assumed to have arisen independently from V, then difference tuples must also be formed from 
S using disjoint pairs. 

Let si, . . ., s q denote the sample tuples (columns of S). We assume that q and m (the number 
of tuples in Y) are even; if not, we simply discard a randomly chosen tuple from Y or S or 
both. Let S* denote the n x (q/2) matrix whose i th column is s* = — s q / 2 +i- Let Y* denote the 
nx (to/2) matrix whose i th column is y* = yi — y m /2+i- The s* tuples have arisen independently 
from V — W where W is a random vector independent of V but identically distributed. Moreover, 
the covariance matrix of V — W, £(v-w)> has eigenvalues 0.5Ai > 0.5A 2 > . . . > 0.5A n (all 
distinct). Finally, the y* tuples have arisen independently from Mt(V — W). Therefore, the PCA 
attack algorithm can be used with S* and Y* to produce M an estimation of Mt- Using this 
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and the sample means, fiM T v+v T and /t v (computed from Y and S), for any 1 < j < m, data 
tuple Xj is estimated as 

x = M' ( Vj - \fiM T v+v T - Mfi v ] ) . (10) 

The intuitive rationale for (flOl) : if M « M T , £im t v+v t ~ M T ^v + ^t, and /2y ~ /Uv> then it 
follows that 

i*l Effectiveness of the Known Sample Attack (PCA Attack) Algorithm on General Distance- 
Preserving Data Perturbation 

Similar to the discussion in Section IV-Dl we focus on the p.d.f., /*, of V — W, and more 
specifically, (1) the difference in the eigenvalues of S(v-w) expressed as the minimum eigen- 
ratio, and (2) the invariance of /*, Inv(f*). It can easily be shown that the minimum eigen-ratio 
of E(v-w) is the same as that of £y Regarding the invariance of /*, if / is multi-variate 
Gaussian, then the discussion in Section IV-Dl implies that Inv(f*) = because the mean of /* 
is 0. Hence, the PCA attack algorithm will likely fail in the case where the original data arose 
from a multi-variate Gaussian distribution and the data perturbation is distance-preserving but 
not orthogonal (i.e. vt ^ 0). 



G. Experiments - General Distance-Preserving Data Perturbation 

In this section, we evaluate the performance of the PCA attack on general distance-preserving 
data perturbation. We produce the translation vector with Matlab's random number generator. 
Once generated, the translation is fixed for the same private data for all the experiments. The 
translation vector is set sufficiently large to distinguish general distance-preserving perturbation 
from orthogonal transformation. 

We first experiment with the same Gaussian data (with a non-zero mean and a high minimum 
eigen-ratio) we used in Section IV-Cl As expected, the attack achieves very low frequency of 
privacy breach regardless of the sample ratio (see Figure [TOl) . Next, we generate data from 
a p.d.f. which is a non-symmetric mixture of two Gaussians with p,\ = (10, 10, 10), Si = 

1 1.5 0.5 \ / 0.1 \ 

1.5 3 2.5 and /i 2 = (20,30,40), S 2 = I o 2 . The mixture weight for the first 

0.5 2.5 75 I \ 40 I 

Gaussian is 0.2 and the weight for the second is 0.8. Note that this mixture p.d.f. has high 
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Fig. 10 Fig. 11 

PCA-BASED ATTACK FOR THREE-DIMENSIONAL GAUSSIAN PCA-BASED ATTACK FOR THREE-DIMENSIONAL GAUSSIAN 



DATA, GENERAL DISTANCE-PRESERVING PERTURBATION. 



MIXTURES, GENERAL DISTANCE-PRESERVING 
PERTURBATION. 



minimum eigen-ratio thereby reducing the effect of this factor in the experiment. The results are 
depicted in Figure [HJ Here we see that the attack works significantly better. We believe this 
is due to the asymmetry of the Gaussian mixture allows problems with invariance to be better 
avoided. 

We also conducted experiments with the Adult real dataset and found the frequency of 
privacy breach to degrade by approximately 10% as compared to the Adult dataset with only an 
orthogonal perturbation as discussed in Section IV-Cl We believe this is due to the fact that the 
underlying generation mechanism for the three attributes of the Adult dataset that we consider 
is sufficiently close to a multivariate Gaussian to cause attack problems due to invariance. This 
claim is based on observing a visualization of the dataset. 

H. Analysis Over MED and Cos Privacy Breach Definitions 

In the case of orthogonal data perturbation or the case of arbitrary Euclidean distance preserv- 
ing perturbation, the PCA-based attack does not depend upon the definition of privacy breach. 
Of course, the empirical analysis does depend upon the privacy breach definition. For brevity, 
we leave to future work the empirical analysis of the known sample attack with respect to other 
e-MED-privacy-breach or e-cos-privacy-breach. 
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VI. Discussion: Vulnerabilities and a Possible Remedy 

When considering known input prior knowledge, dimensionality significantly affects the vul- 
nerability of the data to breach. The larger the difference between the number of linearly 
independent known inputs and the dimensionality, the lower the vulnerability of the data to 
breach. 

When considering known sample prior knowledge, our results point out three factors which 
affect the vulnerability of the data to breach. 1) Dimensionality: our approach has time complexity 
exponential in the number of data attributes. Hence, breaching medium to high dimensional data 
is infeasible. 2) Eigenvalue distinction: the quality of the attacker's estimate depends upon the 
size of the separation between the eigenvalues of the underlying covariance matrix. The smaller 
the separation, the lower the quality of the attacker's estimate. 3) Underlying p.d.f. symmetry 
(invariance): the quality of the attacker's estimate depends upon the symmetries present in the 
data generation p.d.f. If certain types of symmetries are present (called invariances earlier), the 
quality of the attack is low. For orthogonal transformations plus translations, the situation is even 
more difficult for the attacker as these symmetries only need be present after shifting the data 
to have zero mean. 

We conclude the paper by pointing out a potential remedy to the privacy problems described 
earlier for the known sample attack. The data owner generates R, a Ixn matrix with each entry 
sampled independently from a distribution with mean zero and variance one and releases Y = 
RX where R = i~ x l 2 R (this type of data perturbation for i < n was discussed in [30]). It can be 
shown that matrix R is orthogonal on expectation and the probability of orthogonality approaches 
one exponentially fast with I. By increasing t, the data owner can guarantee that distances 
are preserved with arbitrarily high probability. Moreover, it can be shown that the randomness 
introduced by R kills the covariance in Y used by the known sample attack. Specifically, given 
random vector V, it can be shown that, S RV (the covariance matrix of RV) equals J„7 for some 
constant 7. Therefore, the separation between the eigenvalues is zero, so, as mentioned above, 
the known sample attack fails. 

With respect to the known input attack, the RX perturbation is potentially vulnerable. We have 
begun investigating a maximum-likelihood attack technique (see [19] for a summary). Further 
investigation of RX perturbation is left to future work. 
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Appendix I 
Supplementary Material 

A. Known Sample Attack Proofs 

Theorem IBTTl The eigenvalues of S v and £( MtV+ „ t ) are the same. Let Z(M T V + vt)i denote 
the normalized eigenspace of T,(m t v+v t ) associated with Aj, i.e. the set of vectors w E 9ft™ such 
that E( MtV+Vt )W = w\i and ||w|| = 1. It follows that M T Z(V)i = Z(M T V + v T )i, where 
M T Z(V)i equals {M T z : z E Z(V)i}. 

Proof: First we derive an expression for £y in terms of Y>(m t v+v t )- 



Z {MtV+Vt) = E[(M T V + v T - E[M T V + v T }){M T V + v T - E[M T V + v T })'} 
= M T ^ V M' T . 

Now consider any non-zero A E 3? n , and observe that c?et(S v — I n ty — det(M T Y> v M' T — I n \) 
= <iet(S( A / T v+t, T ) — / n A). Therefore A is an eigenvalue of Sy if and only if A is an eigenvalue 
of S(Af T v+„ T ). Finally, consider any non-zero w E 3? n . We have that [w E Z(M T V + Vr)i] ^ 
[E( MT v +t , T )W = w\i and ||iu|| = 1] [M T T, v M^w = w\i and \\w\\ = 1] <^ \Ey{M' T w) = 
(M^w)Ai and ||M^|| = 1] [M' T w E Z{V)i\ [w E M T Z(V)i}. U 

Corollary 15.21 Let I n be the space of all n x n, matrices with each diagonal entry ±1 and 
each off-diagonal entry (2 n matrices in total). There exists D E I n such that M T = WD Z'. 

Proof: Theorem 15 . 1 1 implies that for all 1 < i < n, MxZi = Wi or —MyZi = Wi. Therefore, 
for some D E I n , MtZD = W. Because Dq 1 = D and Z is orthogonal the desired result 
follows. ■ 

Now we provide the derivation details of © under the assumption that / is an n variate 
Gaussian distribution with mean vector /iy = a/i for some a > and invertible co variance 
matrix Ey. 

First of all, for n-variate Gaussian distributions g and h with the same covariance matrix S 
(assumed to be invertible) and mean vectors fi g and fih, we have 

Nnvemhpr Ifi ?(inQ DRAFT 



37 



KL(g\\h)+KL{h\\g) = (jn g - n h )'Yr l {jL g - n h ). (11) 

Second of all, for any D in I n : (1) the covariance matrix of fjj is VFAyW 7 ; (2) the mean 
vector of fjj is WDZ'fx v ; and (3) fo is multivariate Gaussian. Therefore, if / is multi-variate 
Gaussian, then Equations ® and (fTTI) imply 



Inv(f) = min (WDiZ'fi v - WD G Z' Hv)'^v X ( WD i Z ' W ~ WD Z'p v ) 

min fj^(ZDiW -ZD W')(WAvW')~ 1 (WDiZ' -WD Z')fi v 
= min fi' v Z(Di - D Q )Ay 1 (D i - D )Z'[i v 



2 



a 



min (fi'Z(Di - D )A v \D i - D )Z'fjt) 



D^Do&n 

B. Known Input Attack: Proof of Theorem \4.2\ and MED/COS Privacy Breach Derivations 

Theorem 14.21 Let L be the mapping P £ O n _£,. i— > MTUkU' k + V n -kPU' n _ k . Then, L is an affine 
bijection from Q n _ k to M(X 9 , F g ). And, is the mapping M £ M(X„ K 9 ) i-> V' n _ k M\J n ^- 
To prove this theorem we rely upon the following key technical result. 
Lemma 1.1: Let P denote the set {M T U k U' k + V n _ k PU' n _ k : P £ O n _ fc }. Then M(X g , K 9 ) = 

P. 

Proof: Let M(?7 fc , M T U k ) denote the set of all M £ O n such that MU k = M T U k . First 
we show that M(X q , Y q ) = M(U k ,M T U k ). Since Col(X q ) = Col(U k ), then there exists k x p 
matrix A such that U k A = X q . Since A has A; columns, then rank(A) < k. Furthermore, [33, 
pg. 201] implies that k = rank(U k A) < min{/c, rank(A)}, thus, rank(A) = k. Therefore, from 
[33, pg. 90], A has a right inverse. 
For any M £ O n , we have 

M £ M(X q , Y q ) MU k A = M T U k A 

MU k = M T U k . 

The last <^ follows from the fact that /I has a right inverse. We conclude that M.(X q , Y q ) = 
M(U k ,M T U k ). Now we complete the proof by showing that M(U k ,M T U k ) = P. 

(1) For any M £ P, there exists P £ Q n _ fe such that M = {M T U k U' k + V n - k PU' n _ k }. We 
have then 
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MU k = M T U k U' k U k + V n ^ k PU' n _ k U k 
= M T U k . 

If we can show that M is orthogonal, then M G M(U k , M T U k ), so, P C M(U k ,M T U k ), as 
desired. Let U denote [U k \U n - k ] (clearly U G O n ). Observe 



21 



M'M = U k U' k M' T M T U k U' k + U k U' k M' T V n _ k PU' n _ k 

+ TJ n _ k P'V' n _ k M T TJ k TJ' k + TJ n _ k P'V' n _ k M T TJ n _ k PTJ' n „ k 
= U k U' k + + + U n - k U' n _ k 

= uu' = i n . 

where the first zero in the second equality is due to the fact that Col(MTU k ) = Col(Y q ), so 
V^_ k M T U k = 0. 

(2) Now consider M G M(U k ,M T U k ). It can be shown that Col(V n - k ) = Col(MU n - k ) 
Thus, there exists (n — k) x (n — /c) matrix P with Ki-^P = .UP,, /,. Observe that 

p'p = P'(K^K- fc )P 
= (K- fc P)'(K- fe P) 
= (MU n „ k )'(MU n _ k ) = J n _ fc . 

Thus, P G 0„_ fe . Moreover, 

MP = Af[E4|E7 n _ fc ] 

= [M T P fc |MP„_ fe ] 
= [M T U k \V n „ k P\. 

Thus, 



M 



[M r P fc |K_ A P] 
M T U k U' k + V n „ k PU' n _ k . 



TV 

u n-k 



2I Since {MU n -k)'MU k = 0, then Col(MU n - k ) = Col±(MU k ). Since MET* = M T t7* and Col(M T U k ) = CoZ(T 9 ! 
then it follows that Col ± (MU k ) = Col±(M T U k ) = Col ± (Y q ) = CoZ(K-k). 
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Therefore, M G P, so, M(U k , M T U k ) C P, as desired. ■ 

Now we prove Theorem 14.21 

Proof: Clearly L is an affine map. Moreover, Lemma 11.11 directly implies that L maps 
O n _ k onto M(X q ,Y q ). To see that L is one-to-one, consider P\,P 2 G O n - k such that L(Pi) = 
L{P 2 ). By definition, M T U k U' k + V n ^ k P x U' n _ k = M T U k U' k + V n ^ k P 2 U' n _ k , thus, V n ^ k P x U' n _ k = 
V n _ k P 2 U' n _ k . Therefore P x = V^V^^U'^U^ = V^_ k V n _ k P 2 U' n _ k U n „ k = P 2 . 

To complete the proof, consider P G O n -k- We have, V^_ k L(P)U n - k = V^_ k M T U k U' k U n - k + 
Vn_ k V n _ k PU' n _ k U n _ k = + P. Moreover, consider M G M(X q , Y q ). By LemmaO there exists 
P M G O n _ A such that M = M T f/ fc ^ + F n _ fc P M C/;_ fc . We have L(V'_ k MU n _ k ) = L{P M ) = 
M. Therefore, the inverse of L is M G M(X q , Y q ) h-> V^_ A Mi7 n _ fc . ■ 

Now we provide the details of the results crucial to establishing the connections between 
an e-privacy-breach and an e-MED-privacy-breach or e-cos-privacy-breach. First we show that: 



1 — cos(x, x-- 



\ £ - x j\\ 2 



3 2 z. 



|2 



From ([U) and the discussion immediately above it, we have x = M'yj = M' M T Xp and thus, 



|x|| = \\x^\\. It follows that 



\x — Xj\\ 2 2\\Xj\\ 2 — 2x'Xj 



2\\xj\\ 2 2||^.|| 2 
x'x~- 

= 1 - 3 



\x\ 1 1 \x^\ 



= 1 — COs(x, Xj). 

Now we show that: m.i\^ =1 {N AD(x- j i ,x i )} < faf 1 - Let 



i(min) = argmiri^ =1 {NAD(x^ i ,Xi)}. 

Without loss of generality, assume that VI < i < £, x- ^ ^ and W + 1 < i < n, x^ = 0. We 
have: 



Y2i=l( X j,i) 2 

Y. 1 , : ir :h ,fXAI)(.r.,.r,f + Zl^NADjx^xtf 
Si=i( x j,i) 2 
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'T^i=l{ X j,i) 2 ^ ^( X j,i{min)-> Xi(min)) 2 + X^+l ^ ( X j ,i(min) ^ X i(min)) 



J 2 



> NAD(Xj^ min ^Xi( m i n )) 2 . 

C. Known Input Attack: A Rigorous Development of the Closed-Form Expression for p(x~j, e) 
Up to ©, we had derived the following result (for P chosen uniformly from O n -k)' 

p(x p e) = Pr{\\P'B'{U' n _ k x^ - {U' n _ k x$\\ < \\x^\\e), (12) 

where B G Q n -k and satisfies M^Un-kB = V n -k- Now we provide a rigorous proof of ©, 
i.e. the r.h.s. above equals Pr(\\P'(U' n _ k Xj) — {U' n _ k x^)\\ < ||x-||e). To do so, we need some 
material from measure theory. 

Because O n _ fc is a locally compact topological group [32, pg. 293], it has a Haar probability 
measure, denoted by p, over B, the Borel algebra on O n _fc. This is commonly regarded as 
the standard uniform probability measure over O n _fc. Its key property is left-invariance: for all 
B G B and all M G O n _ fe , p(B) = p(MB), i.e., shifting B by a rigid motion does not change 
its probability assignment. 

LetO n _ fc (t/;_ fc x-., | labile) denote the set of all P G O n _ fe such that \\P'{U' n _ k x 3 )-(U' n _ k x 3 )\\ ^ 
\\x-\\e). Let 0^ k (U' k x-,\\xj\\e) denote the set of all P G O n _ fe such that \\P' B' {U' n _ k x^} - 
(Un-k x j)\\ — \\ x j\\ e c3 By definition of \x we have, 



lx(<D) n -k{U' n _ k x--, ||a?-.||e)) = Pr(P uniformly chosen from O n _ fc lies in O n - k (U' n _ k Xj, \ \xj\\e)) 

= Pr(\\P'(U^_ k x.) - (U^_ k x.)\\ <\\x 3 \\e), 

and, 



H{®n-ki u 'n-k x ji I \ x j\ l e )) = Pr ( p uniformly chosen from O n _ fc lies in ®%_ k {U' n _ k x-, \\x-\\e)) 

= Pr{\\P'B\U' n _ k x 3 ) - (K_ k x 3 )\\ < \\x.\\e), 

Therefore, 

22 Since O n -k(U n _ k Xj, \\x- .||e) and 0^_ k (U n _ k Xj, 1 1 a; 1 1 e) are topologically closed sets, then they are Borel subsets of 
O n -k. therefore, /i is defined on each of these. 
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Pr(\\P'B'(U' n _ k x 3 )-(U' n _ k x 3 ) 



= »{B®tk{U' n -kXp\H\e)) 

= fi{O n - k {U' n _ k x.,\\x.\\e)) (13) 

= Pr{\\P'{U' n „ k x.)-{U' n _ k x.)\\< \\x-\\e) 

where the second equality is due to the left-invariance of n and the third equality is due to the 
fact that BO%'_ h (U f n _ k Xj, \\Xj\\e) can be shown to equal <Q n _ k {U' n _ k Xp ||x--||e). 

Since the last equality above was for intuitive purposes only, we will ignore it in completing 
the derivation of a closed form expression. (fT2l) and (fT3l) imply 



p{x-,e) = fi(O n ^ k (U' n _ k xp ||xj.||e)). 
Recall that S n ^ k (\\U' n _ k x^\\) denotes the hyper-sphere in 3? n ~ fe with radius H^-fc^?!! an( ^ cen " 



tered at the origin and S n - k (U' n _ k Xp \\x^\\e) denotes the points contained by S n - k (\\U' n _ k x^\\) 
whose distance from U' n _ k x^ is no greater than ||x--||e. Using basic principles from measure 
theory, it can be shown thai^ 



- k {U' n _ k x h \\xj\\e)) 



SA{S n - k {U' n _ k x h \\x-\\e)) 



SA(S n ^ k ( 

We have arrived at Equation © from Section IIV-DI Next, we derive the desired closed-form 
expression ©. To simplify exposition, we prove the following result for m > 0, z 6 3? m , and 
c > (by plugging in m = n — k, z = U' n _ k x--, and c = ||x-||e, © follows). 



SA(S m (z,c)) 
SA(S m (\\z\\)) 



1 
1 

0.5 
1 - 
1 



if m = 0; 

if c > \\z\\2 and m > 1; 
if c < \ \z\ 2 and m — 1; 

- (l/7r)arccos([e/(||z||^)] 2 - 1) if ||z||V2 < c < \\z\\2 and m = 2; 

- J™^!^'^ 2 - 1 ) sin^i) dfli if INIv^ < c < |M|2 and m > 3; 



(l/7r)orcaM(l - [c/(|M|v^)] 2 ) 

(m-l)r([m+2]/2) f orcco S (l - [o/ ( 1 1 z \ \ 2 ) . m-1 



m % /¥r([m+l]/2) 



Xarcco 
?1=0 



(6>i)d0i 



if c < ||z||\/2 and m = 2; 
if c < ||z||V2 and m > 3. 



(14) 



'Sid It/i'x-l I) consists of two points. Recall that we define — } .^i m^r- 3 ,!l^ as 0.5 if Si(U{ 

'•" 1 3"' r bA(bi(\\U 1 x-.\\)) v 



x p II ^ j 1 1 e ) i s one point, and 



as 1 otherwise. Moreover, we define 



SA(So{U' x-.,\\x,\\ t )) 
bA(S (\\U^x } \\)) 



as 1. 
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Fig. 12 

The hyper-sphere S m (||«||) and two "north pole" 
caps (c < Hzllv^)- 



Fig. 13 

The hyper-sphere 5 to (||«||) and one "south pole" 
CAP (||z|| y/2 < c < \\z\\2). 



Before proving (|T4l) we establish: 

For b > 2 and r > 0, SA(S b (r)) = r((6 + 2)/2) - (15) 

Indeed, with VoZ(.) denoting volume, it can be shown that SA(S b (r)) = dVol( £"^ = Vol{S h {\))^ 
= Y((b+2)/2) ■ T ne l as * equality follows from [36]. Now we return to proving (fl4l) . 

If m = 0, then the surface area ratio equals 1 by definition. If c > ||^||2 and m > 1, then 
the ratio equals 1 since S m (z,c) = S m (\\z\\). If c < \ \z\\2 and m — 1, then, the ratio equals 0.5 
since 5i(z, c) = {z} and = {z, —z}. For the remainder of the derivation, we assume 

that m > 2 and, without loss of generality, z is at the "north pole" of the hyper-sphere S m (\ \z\\), 
i.e. z = (1,0,0, ••• ,0). 

Case c < ||z||v2: The set of points on «S' m .( 1 1^| |) whose distance from z equals c is the 
intersection of S^dl^H) with the hyper-plane whose perpendicular to z is of length h as seen 
in Figure [T2l Thus, S m (z, c) are all those points on S^dl^H) not below that hyper-plane. 

Sub-case m = 2: Since 5*2 (| \z\ |) is an ordinary circle, then the angle 6 in Figure [T2l determines 
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the surface area ratio as follows ( 2e /J^)^(^|INI)) = g^ n _ Moreover, since 9 is the top angle of 
an isocelese triangle with sides of length and base of length c, then sin{9/2) = c/(2||z||). 
The half-angle formula implies that 9 = arccos(l — [c/(\ \z\ Iv^)] 2 )- Therefore, as desired, 

SA(sl(M)) = {1/n)arCCOs{1 - ^/(ll^ll^)] 2 )- (16) 
Sub-case m > 3: Here, computing the surface area ratio is more complicated and requires an 
appeal to the integral definition of the cap surface area. Consider the intersection of SVnGHI) 
with the hyper-plane whose perpendicular to z is of length < h\ < h as seen in Figure [T2| 
The surface area of this intersection equals the surface area of S m -\{r{h\)). Thus, (fT5l) implies 



SA(S m (z,c)) = J SAiSm-Mh^dh 

r((m+ l)/2) J J hl=0 

To evaluate the integral, we change coordinates with hi = \ \z\\(l—cos(9i)). So, hi = 0, h implies 

that 9i = 0,arccos(l-h/\\z\\). And, r(| \z\ | (1 - cos(9i)) = \\z\\sin{9 x ) = Therefore, 



d9i 
dh 

{r{hi)) m - 2 dhi = I r (\\z\\(l-cos(9 1 )) m - 2 —±d9i 



h rarccos(l— h/\ \z 



h 1= J0i=O 

arccos(l— h/\ \z\ 



d9i 

z\ \ m - 2 sin m ~ 2 {9i) | \z\\sin{9i) dB x 



1=0 

/arccos(l— h/\\z\\) 
sin m -\9 1 )d9 1 . 
-1=0 



Plugging this into the previous equations for SA(S m (z, c)) and using (fT5l) . we get 



SA(S m (z, c)) 






m— 1 


r((m + 2)/2) 


SA(S m (\\z\\)) 1 


V r((m + l)/2)m 




|m— lyj-m/2 



sin m ~ 1 (9 1 )d9 1 



r((m+l)/2)m0F Mi=o 

2 

Since /i = ^-ny, then, as desired, we get 

SA(S mM ) _ f{m-l)T({m + 2)/2)\ r—*-*M<*> ^ ^ ^ 



sa(s„,(||2||)) v r((m+i)/2)myi ; ./„,_„ 
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Case \\z\\y/2 < c < \\z\\2: As depicted in Figure [T3l S m (z,c) contains the entire northern 
hemisphere of S^dl^H). Let S n (— z,c) denote the "south pole" cap defined by h' (and c') in 
Figure [T3l (clearly d < \\z\\y/2). We have 



SA(S m (z,c)) 



1 



SA{S m {-z,d) 



(18) 



SA(S m (\\z\\)) SA(S m (\\z\\)) ■ 

By replacing "c" with "c"' in (fT6l) and (fTTT) then plugging the resulting expression into (fl"8l we 
get, 



SA(S m (z,c)) J 1- (l/7r)arccos(l- [c'/(||z||v^)] 2 ) if m = 2; 

5ii(s m (||2||)) i _ i^m^m. r r z os(1 ' [c ' /(UzU ^ )]2) sin m - 1 (e 1 )de 1 ifm>3. 



(19) 



From Figure [T3l it can be seen that 9 is the top angle on an isocelese triangle with sides of 
length \\z\\ and base of length c'. So, sin(8/2) = ^prr. The half-angle formula implies cos{9) = 
l-[c'/(\\z\\V2)] 2 . Similar reasoning shows cos(tt-9) = 1 - [c/(| \z\ \V2)} 2 . Since < 9 < tt/2, 
then cos(n-9) = -cos (9). Thus, [c/(\ \z\ \V2)} 2 - 1 = 1 - [ C '/(| \z\ (v^)] 2 . Plugging 2 - y^] 2 
in for [jj^/j] 2 in COS) yields the desired results. 

D. Known Input Attack: Computing the Closed-Form Expression for p(xj,e) 

Next we develop recursive procedures for computing ©. This amounts to computing the 
following two functions: (i) GR{m) = T([m + 2]/2)/r([m + l]/2) for m > 1; (ii) SI(z,m) = 
Je7=o sin m ~ l {9i) d9\ for 1 > z > and m > 1. Indeed, CD) is equivalent to 



1 
1 

0.5 

1 — (l/n)arccos 



(n-k-l)GR(n-k) 



(n-k-l)GR(n-k) 



if n - k = 0: 



57 



ll^-fe«illv^ 



>- 


Mm, IN 







if 
if 

if 

1, n — fc ] if 
if 

, n — k I if 



bilk > \\V^- k yj\\2 and n- & > 1; 
1 2/ j 1 1 e < ||K-a2/jII 2 and n - fc = 1; 
\Vn- k yj\\V2< \\yj\\e< \\V^ k yj\\2 and n - k = 2; 

IK- fc »llv / 2< ||y.i||e< ||KUwl|2 andn-fc> 3; 

< \\Vn- k y 3 \\V2 and rc - fc = 2; 

< ||Kx-fe2/j||\/2 and — fc > 3. 

(20) 
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To compute GR(m) for m > 1, we use the following facts: T(z + 1) = zY{z) for z > 0, 
T(l/2) = y/n, and T(l) = 1. Thus, we get a recursive procedure for computing GR(m). 



GR(m) 



2 
2 



if m = 1; 
if 77i = 2; 
-) Gi?(m -2) if m > 3. 



(21) 



To compute SI(z, m) for 1 > z > and m > 1, we use the following facts. sin m 2 (arccos(z)) = 
[l_ z 2](m-2)/2 ifm > 3 . Anc i 5 SI{z,m) = [J sin m - 1 {e l )de 1 ] (arccos(z)) - [J sin™' 1 ^) dB{\ (0). 
And, 



/ 



L (6»i)d0i 



H = < 



— cos{w) 



sin m (w)cos(w) 



if m - 1 = 

if 771 - 1 = 1 

if m - 1 > 2 



(22) 



Therefore, 



SI(z,m) = < 



arccos(;z) 

1 - 2 
m— 2 



fS'/(2,m- 2) 



2,(m-2)/2 



if m = 1 
if m = 2 
if m > 3 



(23) 



£. Known Input Attack on General Distance-Preserving Data Perturbation 

Previously, we had considered the case where the data perturbation is assumed to be orthogonal 
(does not involve a fixed translation, vt = 0). Now we briefly discuss how the attack technique 
and its analysis can be extended to arbitrary Euclidean distance preserving perturbation (vt ^ 0). 
Extending the algorithms for inferring 7r a : Since the length of the private data tuples may not 
be preserved, then the definition of validity in Section IIV-AI must be changed: a on / is valid 
if Vi,j G /, \\xi — Xj\\ = \\y a (i) — Va{j)\\- As well, the definition of C(a±, i) (given Ii C /, a± a 
valid assignment on Ii, and i E (I\ Ii)), must change: the set of all j G ({1, . . . , m} \ ai(Ii)) 
such that for all i\ G 1%, \\xi x — = ||y a i(u) — Uj\\- With these changes, Algorithms IIV-A.ll 
IIV-A.2[ and IIV-A.3I work correctly as stated. 

Extending the known input attack: The basic idea is simple and relies upon the fact that the 
same vt is added to all tuples in the perturbation of X q . Fix one tuple, say %\ and yi, and 
consider the following differences x\ = (x q — x\), . . ., x~_ x = (x q — x q -i) and = (y q — yi), 
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• • •, Vg-i = (y q — Uq-i)- Let X~_-y denote the matrix with columns x±, . . . , x~_-y and Y~_ 1 denote 
the matrix with columns yf, . . . , Observe that Y q i 1 = M T X~_ V hence, the attack and its 
analysis from the orthogonal data perturbation case can be applied. The details are straight- 
forward and are omitted for brevity. However, a caveat is in order. The attack depends upon the 
choice of the tuple to fix. Therefore, the attacker examines them all and chooses the highest 
privacy breach probability. 
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